Sentiment Analysis

Author
Affiliations

John R Little

Duke University

Modified

December 27, 2023

Find this repository: https://github.com/libjohn/workshop_textmining

Much of this review comes from the site: https://juliasilge.github.io/tidytext/

The primary library package tidytext enables all kinds of text mining. See Also this helpful free online book: Text Mining with R: A Tidy Approach by Silge and Robinson

library(janeaustenr)
library(tidyverse)
Warning: package 'dplyr' was built under R version 4.3.2
Warning: package 'stringr' was built under R version 4.3.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tidytext)
Warning: package 'tidytext' was built under R version 4.3.2
library(wordcloud2)
Warning: package 'wordcloud2' was built under R version 4.3.2
library(textdata)
Warning: package 'textdata' was built under R version 4.3.2
Rfun
Rfun

Data

We’ll look at some books by Jane Austen, an 18th century novelist. Austen explored women and marriage within the British upper class. The novelist has a unique and well earned following within literature. Her works is consistently discussed and honored. To this day, Austen’s novels are the source of many adaptations, written and on-screen. Through the janeaustenr package we can access and mine the text of six Austen novels. We can call the collection of novels a corpra. An individual novel is a corpus.

austen_books()

Austen is best know for six published works:

austen_books() %>% 
  distinct(book)

Data Cleaning

Text mining typically requires a lot of data cleaning. In this case, we start with the janeaustenr collection that has already been cleaned. Nonetheless, further data wrangling is required. First, identifying a line number for each line of text in each book.

Identify line numbers

original_books <- austen_books() %>%
  group_by(book) %>%
  mutate(line = row_number()) %>%         # identify line numbers
  ungroup()

original_books

Tokens

To work with these data as a tidy dataset, we need to restructure the data through tokenization. In our case a token is a single word. We want one-token-per-row. The unnest_tokens() function (tidytext package) will convert a data frame with a text column into the one-token-per-row format.

Token
Tokenization
defined

The default tokenizing mode is “words”. With the unnest_tokens() function, tokens can be: words, characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex, tweets, and ptb (Penn Treebank).

Process

  1. Group by line number (above)
  2. Make each single word a token
tidy_books <- original_books %>%
  unnest_tokens(word, text)

tidy_books

Now that the data is in the one-word-per-row format, we can manipulate it with tidy tools like dplyr.

Stop Words

tidytext::get_stopwords()

Remove stop-words from the books.

matchwords_books <- tidy_books %>%
  anti_join(get_stopwords())
Joining with `by = join_by(word)`
matchwords_books

Join types

Customize your dictionaries

You can customize stop-words data frames, sentiment data frames, etc.

There are various stop words dictionaries. Here we add the stop word, “farfegnugen” to a custom dictionary. If Jane Austen ever used the word “farfegnugen” that would be weird, or bad. So we will take pains to not calculate the sentiment of that word - whether or not the term shows up in a sentiment dictionary. That is, we will remove the word by making it a part of a customized stop-words dictionary.

stopwords::stopwords_getsources()
[1] "snowball"      "stopwords-iso" "misc"          "smart"        
[5] "marimo"        "ancient"       "nltk"          "perseus"      
stopwords::stopwords_getlanguages("snowball")
 [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"
stopwords_custom <- tribble(~word, ~lexicon,
                            "farfegnugen", "custom")

stopwords_custom
get_stopwords(source = "snowball")
bind_rows(get_stopwords(), stopwords_custom)    # The default is "snowball"

Calculate word frequency

How many Austen countable words are there if we remove snowball stop-words? There are 14375 countable words.

matchwords_books %>% 
  # distinct(word)
  count(word, sort = TRUE) 

Word clouds

matchwords_books %>%
  count(word, sort = TRUE) %>%
  head(100) %>% 
  wordcloud2(size = .4, shape = 'triangle-forward', 
             color = c("steelblue", "firebrick", "darkorchid"), 
             backgroundColor = "salmon")

Basic word cloud

A non-interactive word cloud.

matchwords_books %>%
  count(word) %>%
  with(wordcloud::wordcloud(word, n, max.words = 100))

Your Turn: Exercise 1

Goal: Make a basic word cloud for the novel, Pride and Predjudice, pride_prej_novel

  1. Prepare
pride_prej_novel <- tibble(text = prideprejudice) %>% 
  mutate(line = row_number())
  1. Tokenize pride_prej_novel with unnest_tokens()
  1. Remove stop-words
  1. calculate word frequency
  1. make a simple wordcloud

Sentiment Analysis

get_sentiments()

Let’s see what positive words exist in the bing dictionary. Then, count the frequency of those positive words that exist in Emma.

positive <- get_sentiments("bing") %>%
  filter(sentiment == "positive")                    # get POSITIVE words

positive 
tidy_books %>%
  filter(book == "Emma") %>%                        # only the book _emma_
  semi_join(positive) %>%                           # semi_join()
  count(word, sort = TRUE)
Joining with `by = join_by(word)`

Prepare to visualize sentiment score

Match all the Austen books to the bing sentiment dictionary. Count the word frequency.

tidy_books %>%
  inner_join(get_sentiments("bing")) %>%
  count(book)
Joining with `by = join_by(word)`
Warning in inner_join(., get_sentiments("bing")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Calculate sentiment

Algorithm: sentiment = positive - negative

Define a section of text.

“Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts… – Text Mining with R

bing <- get_sentiments("bing")

janeaustensentiment <- tidy_books %>% 
  inner_join(bing) %>% 
  count(book, index = line %/% 80, sentiment) %>%                          # `%/%` = int division ; 80 lines / section
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%    # spread(sentiment, n, fill = 0)
  mutate(sentiment = positive - negative)                                      # ALGO!!!
Joining with `by = join_by(word)`
Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
janeaustensentiment

Viz it

janeaustensentiment %>%
  ggplot(aes(index, sentiment, )) +
  geom_col(show.legend = FALSE, fill = "cadetblue") +
  geom_col(data = . %>% filter(sentiment < 0), show.legend = FALSE, fill = "firebrick") +
  geom_hline(yintercept = 0, color = "goldenrod") +
  facet_wrap(~ book, ncol = 2, scales = "free_x") 

Preparation: Most common positive and negative words

bing_word_counts <- tidy_books %>%
  inner_join(bing) %>%
  count(word, sentiment, sort = TRUE)
Joining with `by = join_by(word)`
Warning in inner_join(., bing): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 435434 of `x` matches multiple rows in `y`.
ℹ Row 5051 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
bing_word_counts

Viz it too

bing_word_counts %>%
  filter(n > 170) %>%
  mutate(n = if_else(sentiment == "negative", - n, n)) %>%
  ggplot(aes(fct_reorder(str_to_title(word), n), n, fill = str_to_title(sentiment))) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(type = "qual") +
  guides(fill = guide_legend(reverse = TRUE)) +
  labs(title = "Frequency of popular positive and negative words",
       subtitle = "Jane Austen novels",
       y = "Compound sentiment score", x = "",
       fill = "Sentiment", caption = "Source: library(janeaustenr)") +
  theme(plot.title.position = "plot")

Dictionaries

What other dictionaries are available? How to choose?

head(get_sentiments("bing"))
head(get_sentiments("loughran"))
head(get_sentiments("nrc"))
head(get_sentiments("afinn"))
get_sentiments("nrc") %>% 
  count(sentiment, sort = TRUE) 

Afinn

What words in Emma match the AFINN dictionary?

emma_afinn <- tidy_books %>%
  filter(book == "Emma") %>% 
  anti_join(get_stopwords()) %>% 
  inner_join(get_sentiments("afinn"))
Joining with `by = join_by(word)`
Joining with `by = join_by(word)`
emma_afinn
emma_afinn %>% 
  count(word, sort = TRUE)

Make Sections

Just as we calculated sentiment, above, make sections of 80 words then calculate sentiment.

emma_afinn_sentiment <- emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  group_by(index) %>% 
  summarise(sentiment = sum(value))           ## ALGO sum each Afinn score in the 80 word section

emma_afinn_sentiment

Viz it

emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  filter(index == 104) %>%
  count(word, sort = TRUE) %>%
  with(wordcloud::wordcloud(word, n, 
                            rot.per = .3))

emma_afinn %>% 
  mutate(word_count = 1:n(),
         index = word_count %/% 80) %>% 
  filter(index == 104) %>%
  count(word, sort = TRUE) %>%
  wordcloud2(size = .4, shape = 'diamond',
             backgroundColor = "darkseagreen")
emma_afinn_sentiment %>% 
  ggplot(aes(index, sentiment)) +
  geom_col(aes(fill = cut_interval(sentiment, n = 5))) +
  geom_hline(yintercept = 0, color = "forestgreen", linetype = "dashed") +
  scale_fill_brewer(palette = "RdBu", guide = FALSE) +
  theme(panel.background = element_rect(fill = "grey"),
        plot.background = element_rect(fill = "grey"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank()) +
  labs(title = "Afinn Sentiment Analysis of _Emma_")
Warning: The `guide` argument in `scale_*()` cannot be `FALSE`. This was deprecated in
ggplot2 3.3.4.
ℹ Please use "none" instead.

emma_afinn %>%
  mutate(word_count = 1:n(),
         index = as.character(word_count %/% 80)) %>%
  filter(index == 10 | index == 104 | index == 105) %>% 
  ggplot(aes(value, index)) +
  geom_boxplot() +
  # geom_boxplot(notch = TRUE) +
  geom_jitter() +
  coord_flip() +
  labs(y = "section", x = "Afinn")

Resources